AITopics | softmax bottleneck efficiently

Mixtape: Breaking the Softmax Bottleneck Efficiently

Neural Information Processing SystemsDec-25-2025, 09:12:35 GMT

The softmax bottleneck has been shown to limit the expressiveness of neural language models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques--logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.

mixtape, name change, softmax bottleneck efficiently, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.62)

Add feedback

Mixtape: Breaking the Softmax Bottleneck Efficiently

Neural Information Processing SystemsMay-27-2025, 10:51:45 GMT

The softmax bottleneck has been shown to limit the expressiveness of neural lan- guage models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques--logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.

mixtape, softmax bottleneck efficiently

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Reviews: Mixtape: Breaking the Softmax Bottleneck Efficiently

Neural Information Processing SystemsJan-23-2025, 16:33:43 GMT

POS-AUTHOR FEEDBACK I thank the authors for their feedback and clarifications. I have increased my score based on those answers, and trusting that the promised modifications will appear in the final version. I would strongly encourage to make the release of the code as easy to use as possible, ideally with plugins for major platforms. This would not only increase citations, but have a direct impact in a number of use-cases ORIGINAL REVIEW This paper addresses the softmax bottleneck problem: resolving it has shown to significantly improve results when the output is over a large space (eg: NLP). However, current solutions are very costly.

experiment, mixtape, softmax bottleneck efficiently

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)

Add feedback

Reviews: Mixtape: Breaking the Softmax Bottleneck Efficiently

Neural Information Processing SystemsJan-23-2025, 16:33:32 GMT

This paper proposes techniques to deal with the softmax bottleneck problem. Pros • Experimental results show strong performances in language modeling and machine translation. Cons • Writing of the paper can be further enhanced by making it self-contained. The paper represents solid work. There are clarity issues pointed out by the reviewers.

experiment, mixtape, softmax bottleneck efficiently

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.71)

Add feedback

Mixtape: Breaking the Softmax Bottleneck Efficiently

Neural Information Processing SystemsOct-10-2024, 00:46:31 GMT

The softmax bottleneck has been shown to limit the expressiveness of neural lan- guage models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques--logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.

mixtape, softmax bottleneck efficiently

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Mixtape: Breaking the Softmax Bottleneck Efficiently

Yang, Zhilin, Luong, Thang, Salakhutdinov, Russ R., Le, Quoc V.

Neural Information Processing SystemsMar-18-2020, 22:48:23 GMT

The softmax bottleneck has been shown to limit the expressiveness of neural lan- guage models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques--logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.

mixtape, softmax bottleneck efficiently

Neural Information Processing Systems

Genre: Research Report (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback